About the project

I am excited about this project because I think I may be able to learn some R things.

My GitHub repository is at https://github.com/pyrysipila/IODS-project


Chapter 2: Linear regression

Describe the work you have done this week and summarize your learning.

I have worked on data wrangling and linear regression. I think I have learned a lot. Please see below.

I will start by reading in the learning2014 data from my own data folder.

learning2014 <- read.csv("//atkk/home/p/pyrysipi/Documents/Tutkimus/Kurssit/Introduction to Open Data Science 2018/Data/learning2014.csv")

The data were collected from students in Jyväskylä University in 2014-2015. The aim was to assess how attitude and different learning approaches predict success on an introductory statistics course.

More information on the data can be found here: http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt

Below I will explore the structure and dimensions of the data.

#dimensions of the data:
dim(learning2014)
## [1] 166   7
#the data have 166 observations and 7 variables

#structure of the data
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
#First six rows of the data:
head(learning2014)
##   gender Age attitude     deep  stra     surf Points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Next, I will describe the data with nice plots and summaries of the variables

# A scatter plot matrix
pairs(learning2014[-1], col = learning2014$gender)
library(ggplot2)

library(GGally)
p1 <- ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
p1

summary(learning2014)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

From the plots and summaries above, it can be seen that the data consists of 110 women and 56 men. Mean age is 25.5 years. Age is unevenly distributed so that there are some people who are much older than average but no people who are much younger than average. This is called a skewed distribution. Other variables have a distribution that at least resembles a normal distribution.

Attitude has a moderate correlation with points (r~0.4). (Better attitude is associated with better success on the course.) Deep (deep approach) has a moderate negative correlation (r~-0.3) with surf (surface approach). This adds to the validity of these constructs, because one would assume deep and surface approaches to be antagonistic with each other. Correlation between other variables are low (|r|<0.2)

Next, I will fit a linear regression model with points as the dependent variable.

lm1 <- lm(Points ~ attitude + deep + stra, data = learning2014)
summary(lm1)
## 
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

The intercept is statistically significant with an aplha level of 0.05 (p<0.05). Attitude is a statistically significant predictor of points but deep and stra are not. I will drop deep and stra from the model.

lm2 <- lm(Points ~ attitude, data = learning2014)
summary(lm2)
## 
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

For every one points increase in attitude, the student will gain on average ~3.5 points. R-squared is ~0.19, which means that the model explains 19% of the variation in Points. Because attitude is the only vairable in the model, it explains those 19% of variation in Points.

Next, I will draw the diagnostic plots of the linear model I have just constructed (lm2).

plot(lm2, which = c(1,2,5))

The residuals vs fitted plot does not show any obvious patterns. This indicates that the model fits the data reasonably well and the data are homoscedastic.

In the normal Q-Q plot the dots are reasonably close to the diagonal line. This means that the distribution of the residuals is reasonably close to normal distributions. Therefore, the assumption of linear regression of normally distributed residuals is not violated.

The residuals vs leverage plot does not identify any points with a high Cook’s distance (no points are below the dashed read line indicating a Cook’s distance of 0.5). This means that there are no outliers that would have potential to distort the regression model.


Chapter 3: Logistic regression

Import the data from my computer

alc <- read.csv("//atkk/home/p/pyrysipi/Documents/Tutkimus/Kurssit/Introduction to Open Data Science 2018/Data/alc.csv")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

The data

The data were collected from secondary students in Portugal. They contain information on alcohol use and performance in mathematics and portuguese, and some other information as well.

More information on the data can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

I will examine the relationships of sex, Medu (Mother’s education), Fedu (Father’s education) and famsize (family size) with high alcohol use.

My hypotheses are that: 1) men have a higher probability of high alcohol use than women, 2) higher maternal education is associated with lower probability of high alcohol use, 3) higher paternal education is associated with lower probability of high alcohol use and 4) bigger family size is associated withlower probability of high alcohol use and

Cross tabulations

table(sex = alc$sex, high_use = alc$high_use) %>% addmargins()
##      high_use
## sex   FALSE TRUE Sum
##   F     156   42 198
##   M     112   72 184
##   Sum   268  114 382
table(Medu = alc$Medu, high_use = alc$high_use) %>% addmargins()
##      high_use
## Medu  FALSE TRUE Sum
##   0       1    2   3
##   1      33   18  51
##   2      80   18  98
##   3      59   36  95
##   4      95   40 135
##   Sum   268  114 382
table(Fedu = alc$Fedu, high_use = alc$high_use) %>% addmargins()
##      high_use
## Fedu  FALSE TRUE Sum
##   0       2    0   2
##   1      53   24  77
##   2      75   30 105
##   3      72   27  99
##   4      66   33  99
##   Sum   268  114 382
table(Family_size = alc$famsize, high_use = alc$high_use) %>% addmargins()
##            high_use
## Family_size FALSE TRUE Sum
##         GT3   201   77 278
##         LE3    67   37 104
##         Sum   268  114 382

Plots

library(ggplot2)
g_sex <- ggplot(alc, aes(high_use, ..count..))
g_sex + geom_bar(aes(fill=sex), position = "dodge")

g_Medu <- ggplot(alc, aes(x = high_use, y = Medu))
g_Medu + geom_boxplot()

g_Fedu <- ggplot(alc, aes(x = high_use, y = Fedu))
g_Fedu + geom_boxplot()

g_famsize <- ggplot(alc, aes(x = high_use, ..count..))
g_famsize + geom_bar(aes(fill=famsize), position = "dodge")

Men seem to be high users of alcohol more often than women and high use seems to be more common in small families (with 3 or less than 3 members) than in big families (with more than 3 members). Thus, my hypotheses 1 and 4 seem to be correct.

In copntrast, maternal education does not seem to be associated with high alcohol use and paternal education seems to be associated with higher probability of high alcohol use. Thus, my hypotheses 2 and 3 seem to be false.

Logistic regression

m <- glm(high_use ~ sex + Medu + Fedu + famsize, data=alc, family = "binomial")
m
## 
## Call:  glm(formula = high_use ~ sex + Medu + Fedu + famsize, family = "binomial", 
##     data = alc)
## 
## Coefficients:
## (Intercept)         sexM         Medu         Fedu   famsizeLE3  
##    -1.40340      0.85627     -0.06956      0.08132      0.29581  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  377 Residual
## Null Deviance:       465.7 
## Residual Deviance: 449.2     AIC: 459.2

Male sex, higher father’s education (per one unit increase in education) and small familysize (3 or less than 3 members) have positive coefficients indicating a positive association with high alcohol use.

Higher mother’s education (per one unit increase in education) has a negative coefficient indicating a negative association with high alcohol use.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.2457589 0.1194573 0.4895376
## sexM        2.3543727 1.4978628 3.7375354
## Medu        0.9328005 0.7093940 1.2255525
## Fedu        1.0847204 0.8293229 1.4235393
## famsizeLE3  1.3442147 0.8179982 2.1900766

Men have 2.4 times higher odds than women to be high alcohol users. The association is statistically significant (at alpha level 0.05), because the 95% confidence interval does not include 1. My hypothesis 1 seems to be correct.

Each increase of one unit in mother’s educations makes the odds to be high alcohol users 0.93 times lower. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 2 cannot be confirmed.

Each increase of one unit in father’s educations makes the odds to be high alcohol user 1.08 times higher. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 3 cannot be confirmed.

Those from small families have 1.3 times higher odds than those from big families to be high alcohol users. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 4 cannot be confirmed.

Predictive power of the model

Only sex will be included in the predictive model because it was the only variable with a statistically significant association with high alcohol use.

# fit the model
m <- glm(high_use ~ sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, high_use, probability, prediction) %>% tail(10)
##     sex high_use probability prediction
## 373   M    FALSE   0.3913043      FALSE
## 374   M     TRUE   0.3913043      FALSE
## 375   F    FALSE   0.2121212      FALSE
## 376   F    FALSE   0.2121212      FALSE
## 377   F    FALSE   0.2121212      FALSE
## 378   F    FALSE   0.2121212      FALSE
## 379   F    FALSE   0.2121212      FALSE
## 380   F    FALSE   0.2121212      FALSE
## 381   M     TRUE   0.3913043      FALSE
## 382   M     TRUE   0.3913043      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE
##    FALSE   268
##    TRUE    114
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use     FALSE
##    FALSE 0.7015707
##    TRUE  0.2984293
#plot the predictions and actual values

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2984293

The model predicts everyone to be not a high alcohol user. This is wrong in 30% of the cases. The model performs better than flipping a coin: a strategy that would be wrong in 50% of the cases.

Bonus:

10-fold cross validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293

The model has a prediciton error of 0.30. I’ll try to make a better one by including the same variables as in the DataCamp excercise and a couple of extra variables:

# fit the model
m <- glm(high_use ~ school + sex + age + address + famsize + failures + absences, data = alc, family = "binomial")
m
## 
## Call:  glm(formula = high_use ~ school + sex + age + address + famsize + 
##     failures + absences, family = "binomial", data = alc)
## 
## Coefficients:
## (Intercept)     schoolMS         sexM          age     addressU  
##    -3.18210      0.19912      0.92655      0.09062     -0.40680  
##  famsizeLE3     failures     absences  
##     0.30259      0.42108      0.09196  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  374 Residual
## Null Deviance:       465.7 
## Residual Deviance: 419   AIC: 435
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
summary(alc$probability)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09712 0.17396 0.26337 0.29843 0.38262 0.93004
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     79   35
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.65968586 0.04188482
##    TRUE  0.20680628 0.09162304
#plot the predictions and actual values

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486911
#perform 10-fold croos-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801

Now the prediciton error is 0.27. It seems to be that I cannot beat DataCamp.


title: “RStudio Exercise 4”
author: “Pyry Sipilä”
date: “23 November 2018”
output:
html_document:
toc: TRUE

#Chapter 4: Clustering and classification

###Analysis exercises

###1 A new markdown file has been created.

###2

r library(MASS)

## ## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr': ## ## select

r data("Boston") str(Boston)

## 'data.frame': 506 obs. of 14 variables: ## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ chas : int 0 0 0 0 0 0 0 0 0 0 ... ## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ rm : num 6.58 6.42 7.18 7 7.15 ... ## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ dis : num 4.09 4.97 4.97 6.06 6.06 ... ## $ rad : int 1 2 2 3 3 3 5 5 5 5 ... ## $ tax : num 296 242 242 222 222 222 311 311 311 311 ... ## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ black : num 397 397 393 395 397 ... ## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... ## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

r dim(Boston)

## [1] 506 14

r colnames(Boston)

## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age" ## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv" Boston is a classic dataset describing living conditions such as urban environment, location, crime rate, air quality and economic issues in towns in the suburbs of Boston. It has 506 observations and 14 variables. More information on the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

###3

r #install.packages("tidyverse") library(tidyverse)

## -- Attaching packages --------------------------------------------------- tidyverse 1.2.1 --

## v tibble 1.4.2 v purrr 0.2.5 ## v tidyr 0.8.2 v stringr 1.3.1 ## v readr 1.2.1 v forcats 0.3.0

## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x MASS::select() masks dplyr::select()

r #install.packages("corrplot") library(corrplot)

## corrplot 0.84 loaded

r #histogram Boston %>% gather() %>% ggplot(aes(value)) + facet_wrap(~ key, scales = "free") + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

r #scatter plot of all variables in the dataset plot(Boston)

r # calculate the correlation matrix,round it and print it cor_matrix<-cor(Boston) cor_matrix <- cor_matrix %>% round(digits=2) cor_matrix

## crim zn indus chas nox rm age dis rad tax ## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 ## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 ## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 ## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 ## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 ## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 ## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 ## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 ## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 ## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 ## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 ## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 ## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 ## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 ## ptratio black lstat medv ## crim 0.29 -0.39 0.46 -0.39 ## zn -0.39 0.18 -0.41 0.36 ## indus 0.38 -0.36 0.60 -0.48 ## chas -0.12 0.05 -0.05 0.18 ## nox 0.19 -0.38 0.59 -0.43 ## rm -0.36 0.13 -0.61 0.70 ## age 0.26 -0.27 0.60 -0.38 ## dis -0.23 0.29 -0.50 0.25 ## rad 0.46 -0.44 0.49 -0.38 ## tax 0.46 -0.44 0.54 -0.47 ## ptratio 1.00 -0.18 0.37 -0.51 ## black -0.18 1.00 -0.37 0.33 ## lstat 0.37 -0.37 1.00 -0.74 ## medv -0.51 0.33 -0.74 1.00

r # visualize the correlation matrix corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos ="d",tl.cex=0.6)

Many variables have skewed distributions. Rm (average number of rooms per dwelling) has a nice distribution close to normal.

There are quite a few strong correlations. The strongest positive correlation is between tax (full-value property-tax rate per $10,000) and rad (index of accessibility to radial highways) (0.91) and the strongest negative correlation is between dis (weighted mean of distances to five Boston employment centres) and nox (nitrogen oxides concentration (parts per 10 million)) (-0.77).

###4

```r # center and standardize variables boston_scaled <- scale(Boston)

# summaries of the scaled variables summary(boston_scaled) ```

## crim zn indus ## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 ## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 ## Median :-0.390280 Median :-0.48724 Median :-0.2109 ## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 ## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 ## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 ## chas nox rm age ## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 ## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 ## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 ## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 ## dis rad tax ptratio ## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 ## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 ## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 ## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 ## black lstat medv ## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063 ## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989 ## Median : 0.3808 Median :-0.1811 Median :-0.1449 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683 ## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865

r # class of the boston_scaled object class(boston_scaled)

## [1] "matrix"

```r # change the object to data frame boston_scaled <- as.data.frame(boston_scaled)

# create a quantile vector of crim and print it bins <- quantile(boston_scaled$crim) bins ```

## 0% 25% 50% 75% 100% ## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610

```r # create a categorical variable ‘crime’ crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c(“low”,“med_low”,“med_high”,“high”))

# look at the table of the new factor crime table(crime) ```

## crime ## low med_low med_high high ## 127 126 126 127

```r # remove original crim from the dataset boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical variable to scaled data boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset n <- nrow(Boston)

# choose randomly 80% of the rows ind <- sample(n, size = n * 0.8)

# create train set train <- boston_scaled[ind,]

# create test set test <- boston_scaled[-ind,] ``` All the scaled variables have mean zero (and standard deviation 1).

###5

```r # linear discriminant analysis lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object lda.fit ```

## Call: ## lda(crime ~ ., data = train) ## ## Prior probabilities of groups: ## low med_low med_high high ## 0.2623762 0.2475248 0.2376238 0.2524752 ## ## Group means: ## zn indus chas nox rm ## low 0.97321171 -0.9185244 -0.08661679 -0.9025357 0.4455680 ## med_low -0.09448562 -0.3111762 -0.03610305 -0.5913655 -0.1447935 ## med_high -0.36932806 0.1692078 0.17879700 0.4311546 0.1109409 ## high -0.48724019 1.0171096 -0.04073494 1.1048757 -0.4385942 ## age dis rad tax ptratio ## low -0.8884503 0.9000732 -0.6795860 -0.7458263 -0.47056202 ## med_low -0.4155018 0.4194306 -0.5454537 -0.4931682 -0.09632279 ## med_high 0.3638060 -0.3808934 -0.4303678 -0.3250253 -0.27440656 ## high 0.8066568 -0.8701892 1.6382099 1.5141140 0.78087177 ## black lstat medv ## low 0.37509939 -0.77605225 0.54141584 ## med_low 0.31881277 -0.13375648 -0.01248288 ## med_high 0.08468131 0.02531709 0.18988155 ## high -0.80127687 0.92649425 -0.69954484 ## ## Coefficients of linear discriminants: ## LD1 LD2 LD3 ## zn 0.06072276 0.54557584 -1.01012548 ## indus 0.05913640 -0.21908236 0.41111233 ## chas -0.09006277 0.01171516 0.11268488 ## nox 0.41954339 -1.00411213 -1.21957049 ## rm -0.07234874 -0.14395445 -0.15378878 ## age 0.16288020 -0.16612846 -0.10289718 ## dis -0.03021468 -0.14910299 0.29269265 ## rad 3.31526873 1.01657715 0.03380556 ## tax 0.02854150 0.15942040 0.37488995 ## ptratio 0.13683350 -0.16605463 -0.33209169 ## black -0.11320925 0.03537380 0.13047343 ## lstat 0.24958662 -0.30171765 0.30264539 ## medv 0.20595209 -0.43362517 -0.21382427 ## ## Proportion of trace: ## LD1 LD2 LD3 ## 0.9503 0.0371 0.0126

```r # the function for lda biplot arrows lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){ heads <- coef(x) arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices1], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads) text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3) }

# target classes as numeric classes <- as.numeric(train$crime)

# plot the lda results plot(lda.fit, dimen = 2, col = classes, pch = classes) lda.arrows(lda.fit) ```

###6

```r # save the correct classes from test data correct_classes <- test$crime

# remove the crime variable from test data test <- dplyr::select(test, -crime)

# predict classes with test data lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results table(correct = correct_classes, predicted = lda.pred$class) ```

## predicted ## correct low med_low med_high high ## low 14 4 3 0 ## med_low 5 15 6 0 ## med_high 0 7 21 2 ## high 0 0 0 25 The proportion of correct predictions is 14/27 (52%) for low crime rate, 20/30 (67%) for med_low crime rate, 16/22 (73%) for med_high crime rate and 23/23 (100%) for high crime rate. Overall, the predictions are quite good (proportion of correct predictions 73/102~72%), and the higher the crime rate the better the predictions.

###7

```r #Reload and scale the Boston data data(“Boston”) boston_scaled <- scale(Boston)

set.seed(123) # euclidean distance matrix dist_eu <- dist(boston_scaled) # look at the summary of the distances summary(dist_eu) ```

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970

r # k-means clustering km <-kmeans(boston_scaled, centers = 3) #visualize it pairs(Boston, col = km$cluster)

```r ##investigate the optimal number of clusters #set seed for reproducibility set.seed(123)

# determine the number of clusters k_max <- 10

# calculate the total within sum of squares twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results qplot(x = 1:k_max, y = twcss, geom = ‘line’) ```

```r #2 seems to be an optimal number of clusters, because the total within sum of squares drops dramatically when moving from one cluster to two clusters.

#run the algorithm again km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters pairs(boston_scaled, col = km$cluster) ```

Two is the optimal number of clusters. Two clusters seem to differentiate well between towns of low and high crime rate. Other variables do not seem to be able to clearly differentiate between the clusters.

###Bonus

```r #Relaod and scale the Boston data data(“Boston”) boston_scaled <- scale(Boston) #change the data to a data frame boston_scaled <- as.data.frame(boston_scaled)

# k-means clustering km <-kmeans(boston_scaled, centers = 3)

#add the clusters to the data boston_scaled <- data.frame(boston_scaled, km$cluster) glimpse(boston_scaled) ```

## Observations: 506 ## Variables: 15 ## $ crim <dbl> -0.4193669, -0.4169267, -0.4169290, -0.4163384, -0.... ## $ zn <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, ... ## $ indus <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.... ## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.... ## $ nox <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.... ## $ rm <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.22736... ## $ age <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, ... ## $ dis <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711... ## $ rad <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.... ## $ tax <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.... ## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.11... ## $ black <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.44061... ## $ lstat <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078,... ## $ medv <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.... ## $ km.cluster <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...

```r #perform LDA using the clusters as target classes lda.fit <- lda(km.cluster ~ ., data = boston_scaled)

# the function for lda biplot arrows lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){ heads <- coef(x) arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices1], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads) text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3) } # plot the lda results plot(lda.fit, dimen = 2, col = classes, pch = classes) lda.arrows(lda.fit) ```

Rad (index of accessibility to radial highways is the most influential linear separator for the clusters. Other influental linear separators are tax (full-value property-tax rate per $10,000), age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres).

###Super bonus

```r lda.fit <- lda(crime ~ ., data = train)

model_predictors <- dplyr::select(train, -crime)

# check the dimensions dim(model_predictors) ```

## [1] 404 13

r dim(lda.fit$scaling)

## [1] 13 3

```r # matrix multiplication matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling matrix_product <- as.data.frame(matrix_product)

#install.packages(“plotly”) library(plotly) ```

## ## Attaching package: 'plotly'

## The following object is masked from 'package:MASS': ## ## select

## The following object is masked from 'package:ggplot2': ## ## last_plot

## The following object is masked from 'package:stats': ## ## filter

## The following object is masked from 'package:graphics': ## ## layout

r plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

```r train2 <- train train2\(num_crime <- as.numeric(train2\)crime) train2\(num_crime <- scale(train2\)num_crime) train2\(crime <- train2\)num_crime train2 <- select(train2,-num_crime) km <- kmeans(train2, centers = 2)

plot_ly(x = matrix_product\(LD1, y = matrix_product\)LD2, z = matrix_product\(LD3, type= 'scatter3d', mode='markers', color = km\)cluster) ```

There is an obvious difference: crime has four classes but kmeans only two clusters. Apart that, The plots look surprisingly similar. The clusters formed by k-means clustering seem to mostly differentiate between the towns with high crime rate vs the towns with lower crime rates.


1

library(GGally)
ggpairs(human_)

  # calculate the correlation matrix,round it and print it
  cor_matrix<-cor(human_)
  cor_matrix <- cor_matrix %>% round(digits=2)
  cor_matrix
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.18   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.01    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.12   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.19   -0.86     -0.73    0.17
## GNI          0.18   -0.01    0.12     0.19  1.00   -0.10     -0.13    0.00
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.10    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.13    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.00   -0.09     -0.07    1.00
  # visualize the correlation matrix
  library(corrplot)
  corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos ="d",tl.cex=0.6)

Maternal mortality has an escpecially skewed distribution. Life expectation and expected education years are highly correlated and so are maternal mortality and adolescent births. Life expectancy and maternal mortality are highly negatively correlated.

2 Principal componen analysis (unscaled)

#unscaled data
pca_human <- prcomp(human_)

# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                             PC1      PC2      PC3      PC4     PC5     PC6
## Standard deviation     214.3202 54.48589 26.38141 11.47911 4.06687 1.60671
## Proportion of Variance   0.9233  0.05967  0.01399  0.00265 0.00033 0.00005
## Cumulative Proportion    0.9233  0.98298  0.99697  0.99961 0.99995 1.00000
##                           PC7    PC8
## Standard deviation     0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 92.3  6.0  1.4  0.3  0.0  0.0  0.0  0.0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.5, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], sub = "PC1 reflects maternal mortality and PC2 gross national income")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

human_std <- scale(human_)
pca_human_std <- prcomp(human_std)

# create and print out a summary of pca_human
s_std <- summary(pca_human_std)
s_std
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.966 1.1388 0.9900 0.86598 0.69931 0.54001 0.46701
## Proportion of Variance 0.483 0.1621 0.1225 0.09374 0.06113 0.03645 0.02726
## Cumulative Proportion  0.483 0.6452 0.7677 0.86140 0.92253 0.95898 0.98625
##                            PC8
## Standard deviation     0.33172
## Proportion of Variance 0.01375
## Cumulative Proportion  1.00000
# rounded percetanges of variance captured by each PC
pca_pr_std <- round(100*s_std$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr_std
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 48.3 16.2 12.3  9.4  6.1  3.6  2.7  1.4
# create object pc_lab to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

# draw a biplot
biplot(pca_human_std, cex = c(0.5, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_std[1], ylab = pc_lab_std[2], sub = "PC1 reflexts maternal mortality and adolescent births and PC2 women in labor force and parliament")

With standardized variables, the first principal components capture much less variance. That it, because all variables have same variance after standardization, which prevents variables with high variance dominating the principal componen analysis.For the same reason, with standarization PC2 reflects women in labor force and parliament but without standarizaton it recflects gross national income.

4

In the PCA with standardized data, PC1 reflexts maternal mortality and adolescent births and PC2 women in labor force and parliament.

5

library(FactoMineR)
tea <- read.csv("http://factominer.free.fr/course/doc/data_MCA_Tea.csv",sep = ";", header = T)
head(tea)
##       breakfast     tea.time     evening     lunch     dinner     always
## 1     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always
## 2     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always
## 3 Not.breakfast     tea time     evening Not.lunch     dinner Not.always
## 4 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always
## 5     breakfast Not.tea time     evening Not.lunch Not.dinner     always
## 6 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always
##   home     work     tearoom     friends     resto     pub       Tea   How
## 1 home Not.work Not.tearoom Not.friends Not.resto Not.pub     black alone
## 2 home Not.work Not.tearoom Not.friends Not.resto Not.pub     black  milk
## 3 home     work Not.tearoom     friends     resto Not.pub Earl Grey alone
## 4 home Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone
## 5 home Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone
## 6 home Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone
##      sugar     how       where           price age sex          SPC
## 1    sugar tea bag chain store       p_unknown  39   M       middle
## 2 No.sugar tea bag chain store      p_variable  45   F       middle
## 3 No.sugar tea bag chain store      p_variable  47   F other worker
## 4    sugar tea bag chain store      p_variable  23   M      student
## 5 No.sugar tea bag chain store      p_variable  48   M     employee
## 6 No.sugar tea bag chain store p_private label  21   M      student
##           Sport age_Q frequency     escape.exoticism     spirituality
## 1     sportsman 35-44     1/day Not.escape-exoticism Not.spirituality
## 2     sportsman 45-59     1/day     escape-exoticism Not.spirituality
## 3     sportsman 45-59    +2/day Not.escape-exoticism Not.spirituality
## 4 Not.sportsman 15-24     1/day     escape-exoticism     spirituality
## 5     sportsman 45-59    +2/day     escape-exoticism     spirituality
## 6     sportsman 15-24     1/day Not.escape-exoticism Not.spirituality
##       healthy     diuretic     friendliness     iron.absorption
## 1     healthy Not.diuretic Not.friendliness Not.iron absorption
## 2     healthy     diuretic Not.friendliness Not.iron absorption
## 3     healthy     diuretic     friendliness Not.iron absorption
## 4     healthy Not.diuretic Not.friendliness Not.iron absorption
## 5 Not.healthy     diuretic     friendliness Not.iron absorption
## 6     healthy Not.diuretic Not.friendliness Not.iron absorption
##       feminine     sophisticated    slimming    exciting    relaxing
## 1 Not.feminine Not.sophisticated No.slimming No.exciting No.relaxing
## 2 Not.feminine Not.sophisticated No.slimming    exciting No.relaxing
## 3 Not.feminine Not.sophisticated No.slimming No.exciting    relaxing
## 4 Not.feminine     sophisticated No.slimming No.exciting    relaxing
## 5 Not.feminine Not.sophisticated No.slimming No.exciting    relaxing
## 6 Not.feminine Not.sophisticated No.slimming No.exciting    relaxing
##      effect.on.health
## 1 No.effect on health
## 2 No.effect on health
## 3 No.effect on health
## 4 No.effect on health
## 5 No.effect on health
## 6 No.effect on health
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
plot <- gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") 
## Warning: attributes are not identical across measure variables;
## they will be dropped
plot + geom_bar() 

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
plot <- gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") 
## Warning: attributes are not identical across measure variables;
## they will be dropped
plot + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# multiple correspondence analysis
mca <- MCA(tea_time)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

Dim 1 seems to capture those who like tea shops and unpacked tea. Probably they are people who want to enjoy high quality tea in a good atmosphere. Dim 2 seems to capture those who put something else in their tea. I wonder what that may be.